Introduction to R Programming

Welcome

Welcome to the Intro to R Programming Workshop!

Link to Slides: https://favstats.github.io/ds3_r_intro/

Workshop description:

This workshop focuses on the very beginnings of a great journey ahead of you: learning how to use and be comfortable with the statistical programming language R. Together we will explore the basics, from the working environment itself, creating functions for simplifying your tasks, to data management with the tidyverse package. The overarching goal of the workshop is for you to receive the necessary skill set that will enable you to soon embark on your own data science adventures. That being said, the most important aspect of the workshop will be to have fun along the way so that your journey can begin as smoothly and easily as possible.

R as Calculator

At its core R is just a fancy calculator

R can do:

  • + addition
  • - subtraction
  • * multiplication
  • / division
  • ^ exponentiation

Try it out

15 + 10
## [1] 25

Mix operators

(15 + 5) / (2 * 5)
## [1] 2

Note: when we put a hash tag in front of our code then that is a comment.

Comments are just a useful way for us to add some text to our code, for example to explain what we are doing.

## I will be ignored
2*2*2
## [1] 8
## And so will I

An animal data set appears..

But adding up numbers for no reason is no fun.

That’s why we will use a data set about animals to learn some R Basics.

Animal Ageing and Longevity Database (John Snow Labs)

Data on over 4200 animals.

Information on age of maturity, gestation or incubation periods but also longevity (in years).

Animal Lifespan Data

Say you want to know how old an animal is in human years.

We can use the following simple formula to determine that:

\(\frac{\text{Maximum lifespan human}}{\text{Maximum lifespan non-human animal}} = \text{animal to human years conversion ratio}\)

Note: This is just a very rough way to determine the conversion ratio. It is much more complicated in reality.

Why are we using longevity instead of.. let’s say average life expectancy?

According to the Animal Ageing and Longevity Database (John Snow Labs):

“[…] [M]aximum longevity (also called maximum lifespan) […] is the most widely used parameter for comparing the rate of aging between species.”

First some important data points.

Human Longevity

The longest human to ever live is a woman named: Jeanne Calment. The french woman reached the age of 122 (!). She was born in 1875 and lived until 1997.

Animal Longevity Stats

Here is a table of maximum longevity for various animals:

Animal Maximum Longevity (in years)
Human 122.5
Domestic dog 24.0
Domestic cat 30.0
American alligator 77.0
Golden hamster 3.9
King penguin 26.0
Lion 27.0
Greenland shark 392.0
Galapagos tortoise 177.0
African bush elephant 65.0
California sea lion 35.7
Fruit fly 0.3
House mouse 4.0
Giraffe 39.5
Wild boar 27.0

Source: Animal Ageing and Longevity Database.

Dog to Human years

Calculating the animal to human conversion ratio

human_lifespan <- 122.5

dog_lefespan <- 24

So for every year a human ages, a dog “ages” 5.1 human years.

How old is a 15 year old dog in human years?

human_lifespan
## [1] 122.5

R Objects

Now it can be quite tedious to juggle all those numbers around.

Especially if we want to keep reusing numbers we calculated before.

Here to simplify that process are: R objects.

You can think of R objects as saving information, for example simple numbers or just plain text.

In fact it is good to remember that:

\(\color{blue}{\text{"Everything that exists in R is an object."}}\) ~John M. Chambers

By creating an object we save the information to our working environment and we can recall it whenever we want by just running the name of the object.

We create R objects by using the assignment operator:

<-

or

=

tidyverse style guide recommends to use <- for assignment so that is what we are going to use here.

Here is an example using the small exercise from earlier:

  1. First, we assign the human max lifespan to a variable called human_lifespan. The we assign the maximum lifespan of a dog to dog_lifespan.
human_lifespan  <- 122.5
dog_lifespan <- 24

If we now run the respective objects we retrieve the saved numbers.

human_lifespan
## [1] 122.5
dog_lifespan
## [1] 24

Now we can perform the same calculation as before but this time using objects!

dogs_to_human <- human_lifespan / dog_lifespan

The variable dogs_to_human now holds the dog to human years conversion ratio.

dogs_to_human
## [1] 5.104167

Now we ask again: how old is a 15 year old dog in human years?

dogs_to_human*15
## [1] 76.5625

Same result as earlier!

Cat to Human years

  1. Calculate cats to human years conversion ratio using objects
  2. How old is a 15 year old cat in “human years”?
catslifespan <- 30

human_lifespan
## [1] 122.5
cats_to_humanlifespan <- human_lifespan/catslifespan

cats_to_humanlifespan
## [1] 4.083333

For every human year for a cat represents four years

Now we are gonna calculate what how fifteen human years translates to a cat life span

cats_to_humanlifespan * 15
## [1] 61.25

Note that object names could be anything here! I chose to go with short but self-explanatory names but I could have chosen to just name them x, y or z.

A note on naming stuff

\[\color{blue}{\text{"There are only two hard things in Computer Science:}}\] \[\color{blue}{\text{cache invalidation and naming things. ~ Phil Karlton"}}\]

Artist: Allison Horst

There are many approaches to name things in programming.

In this case we will be using a technique called lower snake case, which is also part of the tidyverse style guide.

Use underscores (_) to separate lower case letter words within a name.

Example:

# Recommended
animal_lifespan

# Not recommended
AnimalLifespan
animal.lifespan

Logical Operators

But wait there are more operators! R knows a ton of them.

Logical operators are used for logical tests which can result in either:

\(\text{TRUE}\) or \(\text{FALSE}\)

(sometimes this is also called a boolean variable)

Let’s first create some more objects to try some logical tests!

lion_lifespan <- 27
mouse_lifespan <- 4
fly_lifespan <- 0.3
boar_lifespan <- 27
alligator_lifespan <- 77
greenland_shark_lifespan <- 392
galapagos_tortoise_lifespan <- 177

Equal

== asks whether two values are the same or equal

lion_lifespan == boar_lifespan
## [1] TRUE

The code above tests the following statement: > The lifespan of a lion equals the lifespan of a boar.

Since both maximum lifespans are 27 this is of course a TRUE statement.

Not Equal

!= asks whether two values are the not the same or unequal

lion_lifespan != boar_lifespan
## [1] FALSE

The code above tests the reverse of the above statement: > The lifespan of a lion does not equal the lifespan of a boar.

Since both maximum lifespans are 27 this is of course a FALSE statement.

Greater or Smaller

We can also test whether certain values are greater or smaller than others:

> greater than

human_lifespan > fly_lifespan
## [1] TRUE

The code above tests the statement: > The lifespan of a human is greater than the lifespan of a fly.

Since the maximum human lifespan is 122.5 and a fly does not live longer than 0.3 years this is of course a TRUE statement.

alligator_lifespan < mouse_lifespan
## [1] FALSE

The code above tests the statement: > The lifespan of an alligator is smaller than the lifespan of a mouse.

Since the maximum alligator lifespan is 77 and a mouse lives for 4 years maximum, this is of course a FALSE statement.

Also note the following options:

  • >= greater or equals
  • <= smaller or equals
20 >= 20
## [1] TRUE

Combine Logical Operators

We can also combine logical tests by testing multiple statements at the same time:

  • & stands for “and” (unsurprisingly)
  • | stands for “or”

For example both alligator_lifespan and fly_lifespan have to be greater than mouse_lifespan for the code below to evaluate as TRUE.

alligator_lifespan > mouse_lifespan & fly_lifespan > mouse_lifespan
## [1] FALSE

If we say | (= or) instead, it means either statement evaluation to TRUE is enough!

alligator_lifespan > mouse_lifespan | fly_lifespan > mouse_lifespan
## [1] TRUE

Vectors

Now we learned about objects. But so far they have only ever held a single numeric value.

R is of course much more powerful than that and objects can hold any number and types of data.

Now we will take a look at vectors, or objects that include more than one value.

You can simply imagine vectors as a list of values. They can consist of numbers but also strings (or: text).

In order to create a vector in R we make use of c() (which stands for concatenate)

c(1, 2, 3, 4)
## [1] 1 2 3 4

We can combine the lifespans we assigned into objects earlier:

animal_lifespan <- c(greenland_shark_lifespan, dog_lifespan, 
  galapagos_tortoise_lifespan,
  mouse_lifespan, fly_lifespan,
  lion_lifespan, boar_lifespan,
  alligator_lifespan, human_lifespan)

animal_lifespan
## [1] 392.0  24.0 177.0   4.0   0.3  27.0  27.0  77.0 122.5

We can also create a vector of strings by using quotes:

c("greenland_shark", "dog", 
  "galapagos_tortoise", "mouse", 
  "fly", "lion", "boar",
  "alligator", "human")
## [1] "greenland_shark"    "dog"                "galapagos_tortoise"
## [4] "mouse"              "fly"                "lion"              
## [7] "boar"               "alligator"          "human"

Of course we can also assign these vectors to objects.

Let’s say we have a vector animal_lifespans with different life spans of animals.

animal_lifespans <- c(greenland_shark_lifespan, dog_lifespan, 
  galapagos_tortoise_lifespan,
  mouse_lifespan, fly_lifespan,
  lion_lifespan, boar_lifespan, 
  alligator_lifespan, human_lifespan)

Now if we wanted to get the different year conversation ratios we can simply divide the maximum human age number by the vector.

human_lifespan/animal_lifespans
## [1]   0.3125000   5.1041667   0.6920904  30.6250000 408.3333333   4.5370370
## [7]   4.5370370   1.5909091   1.0000000

Notice how the operation is performed for each item separately and the result is yet another vector.

animal_to_human_ratios <- human_lifespan/animal_lifespans

We can also use logical operators with vectors:

animal_to_human_ratios < 1
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Again, notice how the operation is performed for each item separately and the result is yet another vector, this time consisting of TRUE and FALSE.

Small note on Variable Types

There are four main types of variables:

  • logical: Boolean/binary, is either TRUE or FALSE
class(TRUE)
## [1] "logical"
  • character (or string): simple text, including symbols and numbers "text"
class("I am a character")
## [1] "character"
  • numeric: Numbers. Mathmatical operators can be used here.
class(2020)
## [1] "numeric"
  • factor: characters or strings but ordered in some way
class(as.factor(c("I", "am", "a", "factor")))
## [1] "factor"

Another important value to know is NA. It stands for “Not Available” and simply denotes a missing value.

There also more like dates and datetimes but we won’t focus on that now.

c(NA, 1, 2, NA)
## [1] NA  1  2 NA

A logical operator for vectors: %in%

An incredibly useful operator for vectors is %in%.

The operator checks whether multiple elements occur somewhere in your vectors.

This basic usage looks like this:

\(\color{red}{\text{vector1}}\) %in% \(\color{orange}{\text{vector2}}\)

Let’s first create a vector called animals which includes all the animals in the same

animals <- c("greenland_shark", "dog", 
  "galapagos_tortoise", "mouse", 
  "fly", "lion", "boar",
  "alligator", "human")

Let’s say we want to check whether giraffe, greenland_shark or lion occur in animals.

If we use | we would have to write something like this:

animals == "giraffe" | animals == "greenland_shark"  | animals == "lion"
## [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

With %in% we can simply pass a vector like this:

animals_to_check <-  c("giraffe", "greenland_shark", "lion")
animals %in% animals_to_check
## [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

Where (if at all) does greenlands_shark or giraffe occur within our animals vector?

animals %in%  c("greenland_shark", "giraffe") 
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

This might not look too important of an improvement. But what if we had dozens or hundreds of animals to check?

animals == "honey_bee" | animals == "cardiocondyla_obscurior" | animals == "black_garden_ant" | animals == "pheidole_dentata" | animals == "squinting_bush_brown" | animals == "american_lobster" | animals == "firebelly_toad" | animals == "oriental_firebelly_toad" | animals == "yellow_bellied_toad" | animals == "greenland_shark" | animals == "western_toad" | animals == "yosemite_toad" | animals == "great_plains_toad" | animals == "green_toad" | animals == "canadian_toad" | animals == "red_spotted_toad" | animals == "sonoran_green_toad" | animals == "southern_toad" | animals == "veragoa_stubfoot_toad" | animals == "common_european_toad" | animals == "colorado_river_toad" | animals == "kihansi_spray_toad" | animals == "ridge_headed_toad" | animals == "cuban_toad" | animals == "european_green_toad" | animals == "colombian_giant_toad" | animals == "argentine_toad" | animals == "cane_toad" | animals == "eurura_toad" | animals == "common_horned_frog" | animals == "colombian_horned_frog" | animals == "amazonian_horned_frog" | animals == "ornate_horned_frog" | animals == "green_and_black_dart_poison_frog" 
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

With %in% it’s much more readable:

animals_to_check <-  c("honey_bee", "cardiocondyla_obscurior", "black_garden_ant", "pheidole_dentata", "squinting_bush_brown", "american_lobster", "firebelly_toad", "oriental_firebelly_toad", "yellow_bellied_toad",  "greenland_shark", "western_toad", "yosemite_toad", "great_plains_toad", "green_toad", "canadian_toad", "red_spotted_toad", "sonoran_green_toad", "southern_toad", "veragoa_stubfoot_toad", "common_european_toad",  "colorado_river_toad", "kihansi_spray_toad", "ridge_headed_toad", "cuban_toad", "european_green_toad","colombian_giant_toad", "argentine_toad", "cane_toad", "eurura_toad", "common_horned_frog", "colombian_horned_frog", "amazonian_horned_frog", "ornate_horned_frog")
animals %in% animals_to_check
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Indexing

Indexing is done via square brackets [] for when you want to know a specific value within your object.

\[\color{red}{\text{vector}}[\color{orange}{\text{elements}}]\]

Exracting the first element of a vector:

animal_lifespans[1]
## [1] 392
animals[1]
## [1] "greenland_shark"

Exracting the fifth element of a vector:

animal_lifespans[5]
## [1] 0.3
animals[5]
## [1] "fly"
animals[c(1, 3, 5)]
## [1] "greenland_shark"    "galapagos_tortoise" "fly"

Indexing with logical tests

You can also index using logical tests.

So if an expression evaluates to TRUE it will keep that element and when it evaluates to FALSE it will remove the element.

\[\color{red}{\text{vector}}[\color{orange}{\text{vector of TRUE/FALSE of same length}}]\]

Let’s first take a look at a logical test that extracts all animals that have greater lifespans than humans:

animal_lifespans > human_lifespan
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Now we can use square brackets to only keep those animals that have greater lifespans than humans.

animals[animal_lifespans > human_lifespan]
## [1] "greenland_shark"    "galapagos_tortoise"

Functions

Artist: Allison Horst

\(\color{blue}{\text{"Everything that happens in R is a function."}}\) ~John M. Chambers

You can think of functions as little machines that (in most cases) process some input and create an output.

Input is everything that goes into a function:

  • arguments you can think of as (pre-determined) input types like a lever or numpad.

  • values you can think of as the various settings that the levers or numpads can have.

\[\text{function_name}(\color{orange}{\text{argument}}=\color{lightblue}{\text{value}})\]

The Star Producer

Let’s consider the following function (that does not exist unfortunately): A star_producer!

This little machine creates tiny hand-drawn stars depending on some input. It takes two arguments:

  • how_many tells the machine how many stars to produce
  • type tells the machine how the stars should look like (in this case the machine only supports "squiggly" stars but it could be upgraded in the future when we learn how to create our own functions later on)

Note: just like levers or numpads you also have to be careful what value type an argument expects. You can’t type numbers into a lever and a numpad doesn’t do much when you pull it (other than break). Similarly, a certain argument might require you to input a numeric value, or more complex R objects.

Get ?help

But how do you know which function takes what kind of values?

You can find the documentation of a function when you type in ?function_name. This should tell you about the arguments, how to use the function and what to expect from its output.

Example function: seq

Let’s try this with a function that is included in every fresh R installation: seq.

?seq

From the description we can learn that this function is used to

“[g]enerate regular sequences”.

Its first three arguments are this:

  • from, to: the starting and (maximal) end values of the sequence.

  • by: number: increment of the sequence.

So we specify from when to where our sequence should go and by which number it should increment.

Let’s first take a look at this within our machine allegory.

Image

If we would want to create a vector from 1 to 10 that increments by 1 we can simple specify the following input values for the arguments:

  • from: 1
  • to: 10
  • by: 1

This is how that looks like in code:

seq(from = 1, to = 10, by = 1)
##  [1]  1  2  3  4  5  6  7  8  9 10

Passing Values

Now there are two ways to pass values to functions in R:

  1. Pass by argument names (we already did this above!)
  2. Pass by argument position

In the former case, we specifically mention which arguments we want to pass our values to.

For that, it doesn’t matter in which order we pass our arguments.

So this is perfectly valid code that produces the same result as above even if it seems less intuitive:

seq(to = 10, by = 1, from = 1)
##  [1]  1  2  3  4  5  6  7  8  9 10

But: coders are lazy.

There is no need to always specify which argument you mean exactly when you can just match by position.

So our sequence example could just as well look like this:

seq(1, 10, 1)
##  [1]  1  2  3  4  5  6  7  8  9 10

And it works because the documentation tells us that the first three arguments are from, to, and by.

In the future you will see it often that people just leave out the arguments completely so it’s good to get used to it.

More examples: Mean and Median

In fact, most of the time functions have such intuitive structures that we often wouldn’t need to even look up the documentation (even though it’s advisable to do).

An easy function to use is mean which simply calculates the average of a numeric vectors.

Let’s try this with the animal_lifespans we created earlier.

mean(animal_lifespans)
## [1] 94.53333

Quite old mean value!

median(animal_lifespans)
## [1] 27

Well when we take the median instead of the mean we can see that this is due to high outliers (the median is of course more robust to extreme values).

There are of course many more functions in R and we will get to learn some of them during this workshop.

Creating our own functions

We can create our own function using the call: function().

We encode what is supposed to happen within curly brackets {}.

Here is the anatomy of a function:

\(\color{purple}{\text{name}}\) <- \(\text{function}(\color{orange}{\text{argument}})\){

\(\color{green}{\text{# function body}}\)

\(\color{lightblue}{\text{value}}\) <- \(\color{orange}{\text{argument}}\)

\(\text{return(}\color{lightblue}{\text{value}}\text{)}\)

}

  • \(\color{purple}{\text{Function name}}\):
    • An identifier by which the function is called
  • \(\color{orange}{\text{Argument(s)}}\):
    • Contains a list of values passed to the function
    • Can also contain a default value like this: argument = 1
  • \(\color{green}{\text{Function body}}\):
    • This is executed each time the function is called
  • \(\color{lightblue}{\text{Return value}}\):
    • Ends function call & sends the value back to the global environment

Square functions

Let’s define a function which squares numeric values.

square <- function(number_to_be_squared) { 
  
  squared_number <- number_to_be_squared^2

  return(squared_number)                  
  
}
square(4)
## [1] 16

Tip: In RStudio we can just type fun and enter after the popup and RStudio will just automatically generate a template for a function.

Dog to Human years function

Let’s create a function that is able to calculate dog years into human years. We call the function dog_to_human_years.

dog_to_human_years <- function(animal_years){

  human_lifespan <- 122.5
  dog_lifespan <- 24

  ratio <- human_lifespan/dog_lifespan

  human_years <- animal_years*ratio

  return(human_years)
}
dog_to_human_years(8)
## [1] 40.83333

Works perfect!

A quick note on errors and debugging

Artist: Allison Horst

So you encountered an error:

## Error: object of type 'closure' is not subsettable

Steps to take:

  • Try to understand what it says.

  • Google (or other search engines)

    • Search for the error
    • Search for what you were trying to do (add R or rstats)
  • If the error occurs in a function check whether you passed an object type that isn’t expected by it.

    • Checking the documentation can help here! Type: ?function_name
  • Ask for help! Create a reproducible example (reprex) and post to online communities

    • Stackoverflow
    • Twitter (use #rstats)
    • RStudio community (forum)
  • Ask a Large Language Model (LLM) like ChatGPT, Claude or Bard

    • LLMs can be extremely useful, just copy-paste your code and error

    • However, be careful because they will make up things: always double check!

  • And if all doesn’t work: it’s ok to to take a break and come back later!

Artist: Allison Horst

Exercises I

Now it’s your turn to write some R code!

Please open 02_exercises_I.Rmd for some fun exercises.